{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a4237e23-cc43-4ac1-9b4c-d4ab23a30381",
   "metadata": {},
   "source": [
    "# CPU Acceleration\n",
    "\n",
    "[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/rl-tools/documentation/binder?labpath=04-CPU%20Acceleration.ipynb)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fcd1864e-c0f1-4351-aea2-cc6e3e9d41a9",
   "metadata": {},
   "source": [
    "The generic implementation of matrix multiplication using a triple nested loop is reasonably fast due to the design of **RLtools** allowing the compiler to heavily optimizer the code (especially for smaller matrices). For larger matrices it is beneficial to use the CPUs [SIMD](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) instructions (like e.g. AVX). This can be done through [BLAS](https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms) libraries like Intel MKL or OpenBLAS. We found Intel MKL to be the fastest, but it does not work reliably in Cling (the C/C++ interpreter as a backend of the Jupyter notebook used for this tutorial). Hence we use OpenBLAS in this and the following examples. **RLtools** has a multiplexing header `cpu_mux.h` that automatically includes the CPU operations dispatching to the available backend. The available backend is selected by defining e.g. `RL_TOOLS_BACKEND_ENABLE_OPENBLAS` or `RL_TOOLS_BACKEND_ENABLE_MKL` (these options can also be used during the configuration phase of CMake when using **RLtools** natively by passing them on the CLI using `-D`). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "3af7aa9c-9410-4f6a-b4eb-ce01cfe8ae5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS\n",
    "#include <rl_tools/operations/cpu_mux.h>\n",
    "namespace rlt = rl_tools;"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e6d409c1-fbce-4ae1-b8ad-844161270b5b",
   "metadata": {},
   "source": [
    "We also include `iostream` to print the computation times later on"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3e23f27b-6567-4e45-ad0a-95269debf32f",
   "metadata": {},
   "outputs": [],
   "source": [
    "#include <iostream>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "9a07b6cd-508c-419a-917d-5fa8b5711019",
   "metadata": {},
   "source": [
    "OpenBLAS contains pre-compiled matrix multiplication kernels and hence needs to be linked (or loaded in case of the Cling interpreter):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6b311f6e-db93-48c6-bd53-b78ce219ad83",
   "metadata": {},
   "outputs": [],
   "source": [
    "#pragma cling load(\"openblas\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f5d8b13e-ea82-42bd-b2b0-98bdfc59ae6e",
   "metadata": {},
   "source": [
    "We define two devices to compare the computation time later on. All `CPU_*` devices use the main memory to store containers and allocated matrices are hence compatible between them. For maximum performance it is recommended to allocate the matrices with the device that they are used on afterwards because e.g. Intel MKL allows to allocate chunks of memory at certain boundaries which allow faster loading using SIMD instructions. This is integrated into the particular `rlt::malloc` specialization which is dispatched to by simply calling with the desired device. \n",
    "\n",
    "Because we don't know the outcome of the (or rather we don't want to hard-code it) the `cpu_mux.h` returns a `DEVICE_FACTORY` template that generates the found `CPU_*` device when passing a `CPU` specification. In this case (since we know OpenBLAS is available and that `RL_TOOLS_BACKEND_ENABLE_OPENBLAS` is defined) this is equal to `using DEVICE_BLAS = rlt::devices::CPU_OPENBLAS<rlt::devices::DefaultCPUSpecification>`.\n",
    "\n",
    "You can play around with commenting `#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS` out in the first cell and re-running the notebook. In that case, you can see that it will use a normal `CPU` for `DEVICE_BLAS` and result in equally slow computation times for both devices in the benchmark lateron."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3f34afd1-a1e8-4028-83c2-40a364bb66f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "using DEVICE = rlt::devices::DefaultCPU;\n",
    "using DEVICE_BLAS = rlt::devices::DEVICE_FACTORY<rlt::devices::DefaultCPUSpecification>;\n",
    "using T = double;\n",
    "using TI = typename DEVICE::index_t;"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "72424853-a826-4ad4-970a-69ecb6ffc1d8",
   "metadata": {},
   "source": [
    "We allocate $A$, $B$, $C$ and $C_{blas}$ matrices to evaluate the computation:\n",
    "$$C = A \\cdot B$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "89b6b0e1-8818-4e2a-91e8-1c854bccd6f9",
   "metadata": {},
   "outputs": [],
   "source": [
    "DEVICE device;\n",
    "DEVICE_BLAS device_blas;\n",
    "constexpr TI SIZE = 500;\n",
    "rlt::Matrix<rlt::matrix::Specification<T, TI, SIZE, SIZE>> A, B, C, C_blas;\n",
    "auto rng = rlt::random::default_engine(typename DEVICE::SPEC::RANDOM(), 1);"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eb7d6f2d-bfce-424c-9358-8749f41bfb0a",
   "metadata": {},
   "source": [
    "We allocate all the matrices and fill $A$ and $B$ with random numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f2b829da-bf19-413a-85e8-5241a9286f7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "rlt::malloc(device_blas, A);\n",
    "rlt::malloc(device_blas, B);\n",
    "rlt::malloc(device_blas, C);\n",
    "rlt::malloc(device_blas, C_blas);\n",
    "rlt::randn(device, A, rng);\n",
    "rlt::randn(device, B, rng);"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f9b62d58-51af-4641-a17c-1bea1e3998c3",
   "metadata": {},
   "source": [
    "Now we can benchmark the matrix multiplication using the generic implementation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ef6b6eaa-0b43-46dd-82a7-3feb36c2b50a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time Cling (C/C++ interpreter): 1.24116 seconds\n"
     ]
    }
   ],
   "source": [
    "{\n",
    "    auto start = std::chrono::high_resolution_clock::now();\n",
    "    rlt::multiply(device, A, B, C);\n",
    "    auto end = std::chrono::high_resolution_clock::now();\n",
    "    auto duration = std::chrono::duration<double>(end - start);\n",
    "    std::cout << \"Time Cling (C/C++ interpreter): \" << duration.count() << \" seconds\" << std::endl;\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e5c5b55a-8857-4490-92e6-d77dfb1f4fad",
   "metadata": {},
   "source": [
    "Equivalently we can run the same computation using OpenBLAS and get the result in much less time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "55bed031-7e0c-4dd8-b3a7-9b50b7deb5db",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time Cling (C/C++ interpreter): 0.00617779 seconds\n"
     ]
    }
   ],
   "source": [
    "{\n",
    "    auto start = std::chrono::high_resolution_clock::now();\n",
    "    rlt::multiply(device_blas, A, B, C_blas);\n",
    "    auto end = std::chrono::high_resolution_clock::now();\n",
    "    auto duration = std::chrono::duration<double>(end - start);\n",
    "    std::cout << \"Time Cling (C/C++ interpreter): \" << duration.count() << \" seconds\" << std::endl;\n",
    "}"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bbb590d2-2675-4f5e-8aa7-491ebb38e633",
   "metadata": {},
   "source": [
    "Now we check the resulting matrices against each other:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "800a7a29-ec60-4cbd-98d7-2a524d5bb328",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Absolute difference between the resulting matrices: 2.95702e-09\n"
     ]
    }
   ],
   "source": [
    "std::cout << \"Absolute difference between the resulting matrices: \" << rlt::abs_diff(device, C, C_blas) << std::endl;"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8f6de1a5-e6e8-442d-8922-3ea1bb67cdd1",
   "metadata": {},
   "source": [
    "Depending on the machine, compilers and library (versions) used, we might find that they are exactly equal but this is not necessarily the case. By changing `T` to `float` the deviations should be bigger but also for `double` this could happen because of floating point inaccuracies which entail that the same mathematical expression does not necessarily lead to the same result if you reorder the (finite precision) computations. \n",
    "\n",
    "Another sanity check is printing the matrices, which is infeasible for their full size, hence we only print a subview of size $5$. This can be done using the `rlt::view` operator which yields a submatrix that is a view of the original matrix. The view is an ordinary `rlt::Matrix` and hence can be used in all operations that take matrices as input. Since it is a view it is cheap (zero-copy) because it only carries a single pointer at runtime as well as compile-time information about the shape (dimensions and [pitch](https://en.wikipedia.org/wiki/Stride_of_an_array)).  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "05d59477-5092-495c-9322-ee08018b6bd4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Result comparison (top-left 5x5 for brevity)\n",
      "\n",
      "C: \n",
      "   27.437886   -55.120778    -2.961325    35.691352    -1.118319 \n",
      "   -9.361667    37.296970   -34.454112     5.538143     8.603580 \n",
      "    7.339524    -4.856299    -8.972083    30.239564    21.464767 \n",
      "  -41.651640    23.181597   -15.082368    11.505067    20.234685 \n",
      "   29.919496    26.737572     9.850308    19.331748    13.406330 \n",
      "\n",
      "C_blas: \n",
      "   27.437886   -55.120778    -2.961325    35.691352    -1.118319 \n",
      "   -9.361667    37.296970   -34.454112     5.538143     8.603580 \n",
      "    7.339524    -4.856299    -8.972083    30.239564    21.464767 \n",
      "  -41.651640    23.181597   -15.082368    11.505067    20.234685 \n",
      "   29.919496    26.737572     9.850308    19.331748    13.406330 \n"
     ]
    }
   ],
   "source": [
    "constexpr TI VIEW_SIZE = 5;\n",
    "using VIEW = rlt::matrix::ViewSpec<VIEW_SIZE, VIEW_SIZE>;\n",
    "auto vC = rlt::view(device, C, VIEW{}, 0, 0);\n",
    "auto vC_blas = rlt::view(device, C_blas, VIEW{}, 0, 0);\n",
    "\n",
    "std::cout << \"Result comparison (top-left \" << VIEW_SIZE << \"x\" << VIEW_SIZE << \" for brevity)\" << std::endl;\n",
    "std::cout << std::endl << \"C: \" << std::endl; \n",
    "rlt::print(device, vC);\n",
    "std::cout << std::endl << \"C_blas: \" << std::endl; \n",
    "rlt::print(device, vC_blas);"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "C++17",
   "language": "C++17",
   "name": "xcpp17"
  },
  "language_info": {
   "codemirror_mode": "text/x-c++src",
   "file_extension": ".cpp",
   "mimetype": "text/x-c++src",
   "name": "c++",
   "version": "17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}